Word wrap

In text display, line wrap is the feature of continuing on a new line when a line is full, such that each line fits in the viewable window, allowing text to be read from top to bottom without any horizontal scrolling.

Word wrap is the additional feature of most text editors, word processors, and web browsers, of breaking lines between and not within words, except when a single word is longer than a line.

A soft return is the break resulting from line wrap or word wrap, whereas a hard return is an intentional break, creating a new paragraph.

Similarly, a hard wrap inserts actual line breaks in the text at wrap points, whereas a soft wrap puts the text into separate lines without inserting line breaks.

Soft wrapping allows line lengths to adjust automatically with adjustments to the width of the user's window or margin settings. Soft wrapping is a standard feature of all modern text editors, word processors, and email clients.

Contents

Word boundaries, hyphenation, and hard spaces

The soft returns are usually placed after the ends of complete words, or after the punctuation that follows complete words. However, word wrap may also occur following a hyphen inside of a word. This is sometimes not desired, and can be blocked by using a non-breaking hyphen, or hard hyphen, instead of a regular hyphen.

A word without hyphens can be made wrappable by having soft hyphens in it. When the word isn't wrapped (i.e., isn't broken across lines), the soft hyphen isn't visible. But if the word is wrapped across lines, this is done at the soft hyphen, at which point it is shown as a visible hyphen on the top line where the word is broken. (In the rare case of a word that is meant to be wrappable by breaking it across lines but without making a hyphen ever appear, a zero-width space is put at the permitted breaking point(s) in the word.)

Sometimes word wrap is undesirable between adjacent words. In such cases, word wrap can usually be blocked by using a hard space or non-breaking space between the words, instead of regular spaces.

Word wrapping in text containing Chinese, Japanese, and Korean

In Chinese, Japanese, and Korean, each Han character is normally considered a word, and therefore word wrapping can usually occur before and after any Han character.

Under certain circumstances, however, word wrapping is not desired. For instance,

Most existing word processors and typesetting software cannot handle either of the above scenarios.

CJK punctuation may or may not follow rules similar to the above-mentioned special circumstances. It is up to line breaking rules in CJK.

A special case of line breaking rules in CJK, however, always applies: line wrap must never occur inside the CJK dash and ellipsis. Even though each of these punctuation marks must be represented by two characters due to a limitation of all existing character encodings, each of these are intrinsically a single punctuation mark that is two ems wide, not two one-em-wide punctuation marks.

Algorithm

Word wrapping is an optimization problem. Depending on what needs to be optimized for, different algorithms are used.

Minimum length

A simple way to do word wrapping is to use a greedy algorithm that puts as many words on a line as possible, then moving on to the next line to do the same until there are no more words left to place. This method is used by many modern word processors, such as OpenOffice.org Writer and Microsoft Word. This algorithm is optimal in that it always puts the text on the minimum number of lines. The following pseudocode implements this algorithm:

SpaceLeft := LineWidth
for each Word in Text
    if (Width(Word) + SpaceWidth) > SpaceLeft
        insert line break before Word in Text
        SpaceLeft := LineWidth - Width(Word)
    else
        SpaceLeft := SpaceLeft - (Width(Word) + SpaceWidth)

Where LineWidth is the width of a line, SpaceLeft is the remaining width of space on the line to fill, SpaceWidth is the width of a single space character, Text is the input text to iterate over and Word is a word in this text.

Minimum raggedness

A different algorithm, used in TeX, minimizes the square of the space at the end of lines to produce a more aesthetically pleasing result. The algorithm above is not optimal with respect to this, as the following example demonstrates:

aaa bb cc ddddd

If the cost function of a line is defined by the remaining space squared, the greedy algorithm would yield a sub-optimal solution for the problem (for simplicity, consider a fixed-width font and line width 6):

------    Line width: 6
aaa bb    Remaining space: 0 (cost = 0 squared = 0)
cc        Remaining space: 4 (cost = 4 squared = 16)
ddddd     Remaining space: 1 (cost = 1 squared = 1)

Summing to a total cost of 17, while an optimal solution would look like this:

------    Line width: 6
aaa       Remaining space: 3 (cost = 3 squared = 9)
bb cc     Remaining space: 1 (cost = 1 squared = 1)
ddddd     Remaining space: 1 (cost = 1 squared = 1)

The difference here is that the first line is broken before bb instead of after it, yielding a better right margin and a lower cost 11.

To solve the problem we need to define a cost function c(i, j) that computes the cost of a line consisting of the words \text{Word}[i] to \text{Word}[j] from the text:

c(i, j) = \left(\text{LineWidth}-(j-i)\cdot\text{OneSpaceWidth}-\sum_{k=i}^j \text{WidthOf}(\text{Word}[k])\right)^P.

Where P typically is 2 or 3. There are some special cases to consider: If the result is negative (that is, the sequence of words cannot fit on a line), the cost needs to reflect the cost of tracking or condensing the text to fit; if that is not possible, it needs to return \infty.

The cost of the optimal solution can be defined as a recurrence:

f(j) = \begin{cases}
  c(1, j) & \text{if } c(1, j) < \infty, \\ 
  \displaystyle \min_{1 \leq k < j} \big(f(k) %2B c(k %2B 1, j)\big) & \text{if } c(1, j) = \infty.
\end{cases}

This can be efficiently implemented using dynamic programming, for a time and space complexity of O(j^2).[1] Faster but more complicated linear time algorithms are also known.[2][3]

See also

References

  1. ^ Knuth, Donald E.; Plass, Michael F. (1981), "Breaking paragraphs into lines", Software: Practice and Experience 11 (11): 1119–1184, doi:10.1002/spe.4380111102 .
  2. ^ Wilber, Robert (1988), "The concave least-weight subsequence problem revisited", Journal of Algorithms 9 (3): 418–425, doi:10.1016/0196-6774(88)90032-6, MR955150 .
  3. ^ Galil, Zvi; Park, Kunsoo (1990), "A linear-time algorithm for concave one-dimensional dynamic programming", Information Processing Letters 33 (6): 309–311, doi:10.1016/0020-0190(90)90215-J, MR1045521 .

External links

Knuth's algorithm

Other word-wrap links